skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Forthmann, Boris"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. ABSTRACT Automated scoring is a current hot topic in creativity research. However, most research has focused on the English language and popular verbal creative thinking tasks, such as the alternate uses task. Therefore, in this study, we present a large language model approach for automated scoring of a scientific creative thinking task that assesses divergent ideation in experimental tasks in the German language. Participants are required to generate alternative explanations for an empirical observation. This work analyzed a total of 13,423 unique responses. To predict human ratings of originality, we used XLM‐RoBERTa (Cross‐lingual Language Model‐RoBERTa), a large, multilingual model. The prediction model was trained on 9,400 responses. Results showed a strong correlation between model predictions and human ratings in a held‐out test set (n = 2,682;r = 0.80; CI‐95% [0.79, 0.81]). These promising findings underscore the potential of large language models for automated scoring of scientific creative thinking in the German language. We encourage researchers to further investigate automated scoring of other domain‐specific creative thinking tasks. 
    more » « less
  2. Free, publicly-accessible full text available March 13, 2026
  3. Human ratings are ubiquitous in creativity research. Yet the process of rating responses to creativity tasks—typically several hundred or thousands of responses, per rater—is often time consuming and expensive. Planned missing data designs, where raters only rate a subset of the total number of responses, have been recently proposed as one possible solution to decrease overall rating time and monetary costs. However, researchers also need ratings that adhere to psychometric standards, such as a certain degree of reliability, and psychometric work with planned missing designs is currently lacking in the literature. In this work, we introduce how judge response theory and simulations can be used to fine-tune planning of missing data designs. We provide open code for the community and illustrate our proposed approach by a cost-effectiveness calculation based on a realistic example. We clearly show that fine tuning helps to save time (to perform the ratings) and monetary costs, while simultaneously targeting expected levels of reliability. 
    more » « less
  4. Scoring divergent thinking tasks opens multiple avenues and possibilities – decisions researchers have to make. While some scholars postulate that scoring should focus on the best ideas provided, the measurement of the best responses (e.g., “top scoring”) comes along with challenges. More specifically, compared to the average quality across all responses, top scoring uses less information—the “bad” ideas are thrown away—which decreases reliability. To resolve this issue, this article introduces a multidimensional top-scoring approach analogous to linear growth modeling which retains information provided by all responses (best ideas and “bad” ideas). Across two studies, using both subjective human ratings and semantic distance originality scoring of responses to over a dozen divergent thinking tasks, we demonstrated that Maximum (the best idea) and Top2 Scoring (two best ideas) could surpass typically applied average scoring in measurement precision when the “bad” ideas’ originality is used as auxiliary information (i.e., additional information in the analysis). We thus recommend retaining all ideas when scoring divergent thinking tasks, and we discuss the potential this new approach holds for creativity research and practice. 
    more » « less
  5. Creativity research often relies on human raters to judge the novelty of participants’ responses on open-ended tasks, such as the Alternate Uses Task (AUT). Albeit useful, manual ratings are subjective and labor intensive. To address these limitations, researchers increasingly use automatic scoring methods based on a natural language processing technique for quantifying the semantic distance between words. However, many methodological choices remain open on how to obtain semantic distance scores for ideas, which can significantly impact reliability and validity. In this project, we propose a new semantic distance-based method, maximum associative distance (MAD), for assessing response novelty in AUT. Within a response, MAD uses the semantic distance of the word that is maximally remote from the prompt word to reflect response novelty. We compare the results from MAD with other competing semantic distance-based methods, including element-wise-multiplication—a commonly used compositional model—across three published datasets including a total of 447 participants. We found MAD to be more strongly correlated with human creativity ratings than the competing methods. In addition, MAD scores reliably predict external measures such as openness to experience. We further explored how idea elaboration affects the performance of various scoring methods and found that MAD is closely aligned with human raters in processing multi-word responses. The MAD method thus improves the psychometrics of semantic distance for automatic creativity assessment, and it provides clues about what human raters find creative about ideas. 
    more » « less
  6. Semantic distance scoring provides an attractive alternative to other scoring approaches for responses in creative thinking tasks. In addition, evidence in support of semantic distance scoring has increased over the last few years. In one recent approach, it has been proposed to combine multiple semantic spaces to better balance the idiosyncratic influences of each space. Thereby, final semantic distance scores for each response are represented by a composite or factor score. However, semantic spaces are not necessarily equally weighted in mean scores, and the usage of factor scores requires high levels of factor determinacy (i.e., the correlation between estimates and true factor scores). Hence, in this work, we examined the weighting underlying mean scores, mean scores of standardized variables, factor loadings, weights that maximize reliability, and equally effective weights on common verbal creative thinking tasks. Both empirical and simulated factor determinacy, as well as Gilmer-Feldt’s composite reliability, were mostly good to excellent (i.e., > .80) across two task types (Alternate Uses and Creative Word Association), eight samples of data, and all weighting approaches. Person-level validity findings were further highly comparable across weighting approaches. Observed nuances and challenges of different weightings and the question of using composites vs. factor scores are thoroughly provided. 
    more » « less